Duplikasi adalah keberadaan data ganda dalam dataset yang sama. Duplikasi dapat menghasilkan hasil analisis yang bias atau tidak akurat. Duplikasi pada data dapat diklasifikasikan sebagai berikut: | Jenis Duplikasi | Detail Jenis Duplikasi |Deskripsi| |-----------|-------------------|----------------------| | Duplikasi Baris | Duplikasi sebagian kolom |Terdapat 2 atau lebih baris yang sebagian nilai kolomnya sama| | | Duplikasi seluruh kolom|Terdapat 2 atau lebih baris yang seluruh nilai kolomnya sama| | Duplikasi Kolom | Duplikasi values |Terdapat 2 atau lebih kolom yang seluruh nilai barisnya sama| | | Duplikasi index |Terdapat 2 atau lebih kolom yang seluruh nilai barisnya tampak berbeda, namun setiap nilai di kolom yang satu dapat dipetakan ke nilai lain di kolom lainnya secara konsisten|
# pip install pandas
# pip install numpy
import pandas as pd
import numpy as np
df = pd.DataFrame({'visitor_name':['Rian','Joni','Lika','Rian','Bima'],
'age':[25,23,18,25,25],
'event_name':['Rock Concert 2023','Jazz Festival 2023','Pop Party 2022','Rock Concert 2023','Pop Party 2022'],
'ticket_sold_year':[2023,2023,2022,2023,2022],
'event_year':[2023,2023,2022,2023,2022],
'ticket_type':['Gold','Silver','Bronze','Silver','Silver'],
'ticket_price':[100000.0,50000.0,30000.0,50000.0,60000.0],
'reward_point':[400,350,800,200,250]})
df['age'] = df['age'].astype('Int64')
df['ticket_sold_year'] = df['ticket_sold_year'].astype('Int64')
df['event_year'] = df['event_year'].astype('Int64')
df['ticket_price'] = df['ticket_price'].astype(float)
df['reward_point'] = df['reward_point'].astype('Int64')
df
| visitor_name | age | event_name | ticket_sold_year | event_year | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|---|---|
| 0 | Rian | 25 | Rock Concert 2023 | 2023 | 2023 | Gold | 100000.0 | 400 |
| 1 | Joni | 23 | Jazz Festival 2023 | 2023 | 2023 | Silver | 50000.0 | 350 |
| 2 | Lika | 18 | Pop Party 2022 | 2022 | 2022 | Bronze | 30000.0 | 800 |
| 3 | Rian | 25 | Rock Concert 2023 | 2023 | 2023 | Silver | 50000.0 | 200 |
| 4 | Bima | 25 | Pop Party 2022 | 2022 | 2022 | Silver | 60000.0 | 250 |
Untuk melihat semua baris yang duplikat (termasuk baris yang muncul pertama kali), gunakan keep. keep menunjukkan baris-baris mana sajakah yang harus ditandai sebagai duplikat. keep dapat bernilai "first", "last", atau False.
keep = "first": seluruh duplikat, kecuali baris yang muncul pertama, akan ditandai sebagai duplikat
keep = "last": seluruh duplikat, kecuali baris yang muncul terakhir, akan ditandai sebagai duplikat
keep = False: seluruh duplikat akan ditandai sebagai duplikat
# penghapusan semua baris duplikasi kecuali yang pertama
df1 = df.copy()
df1.drop_duplicates(subset='visitor_name', keep='first', inplace=True)
df1
| visitor_name | age | event_name | ticket_sold_year | event_year | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|---|---|
| 0 | Rian | 25 | Rock Concert 2023 | 2023 | 2023 | Gold | 100000.0 | 400 |
| 1 | Joni | 23 | Jazz Festival 2023 | 2023 | 2023 | Silver | 50000.0 | 350 |
| 2 | Lika | 18 | Pop Party 2022 | 2022 | 2022 | Bronze | 30000.0 | 800 |
| 4 | Bima | 25 | Pop Party 2022 | 2022 | 2022 | Silver | 60000.0 | 250 |
# penghapusan semua baris duplikasi kecuali yang terakhir
df2 = df.copy()
df2.drop_duplicates(subset='visitor_name', keep='last', inplace=True)
df2
| visitor_name | age | event_name | ticket_sold_year | event_year | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|---|---|
| 1 | Joni | 23 | Jazz Festival 2023 | 2023 | 2023 | Silver | 50000.0 | 350 |
| 2 | Lika | 18 | Pop Party 2022 | 2022 | 2022 | Bronze | 30000.0 | 800 |
| 3 | Rian | 25 | Rock Concert 2023 | 2023 | 2023 | Silver | 50000.0 | 200 |
| 4 | Bima | 25 | Pop Party 2022 | 2022 | 2022 | Silver | 60000.0 | 250 |
# penghapusan semua baris duplikasi
df3 = df.copy()
df3.drop_duplicates(subset='visitor_name', keep=False, inplace=True)
df3
| visitor_name | age | event_name | ticket_sold_year | event_year | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|---|---|
| 1 | Joni | 23 | Jazz Festival 2023 | 2023 | 2023 | Silver | 50000.0 | 350 |
| 2 | Lika | 18 | Pop Party 2022 | 2022 | 2022 | Bronze | 30000.0 | 800 |
| 4 | Bima | 25 | Pop Party 2022 | 2022 | 2022 | Silver | 60000.0 | 250 |
df = pd.DataFrame({'visitor_name':['Rian','Joni','Lika','Rian','Bima'],
'age':[25,23,18,25,25],
'event_name':['Rock Concert 2023','Jazz Festival 2023','Pop Party 2022','Rock Concert 2023','Pop Party 2022'],
'ticket_sold_year':[2023,2023,2022,2023,2022],
'event_year':[2023,2023,2022,2023,2022],
'ticket_type':['Gold','Silver','Bronze','Gold','Silver'],
'ticket_price':[100000.0,50000.0,30000.0,100000.0,60000.0],
'reward_point':[400,350,800,400,250]})
df['age'] = df['age'].astype('Int64')
df['ticket_sold_year'] = df['ticket_sold_year'].astype('Int64')
df['event_year'] = df['event_year'].astype('Int64')
df['ticket_price'] = df['ticket_price'].astype(float)
df['reward_point'] = df['reward_point'].astype('Int64')
df
| visitor_name | age | event_name | ticket_sold_year | event_year | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|---|---|
| 0 | Rian | 25 | Rock Concert 2023 | 2023 | 2023 | Gold | 100000.0 | 400 |
| 1 | Joni | 23 | Jazz Festival 2023 | 2023 | 2023 | Silver | 50000.0 | 350 |
| 2 | Lika | 18 | Pop Party 2022 | 2022 | 2022 | Bronze | 30000.0 | 800 |
| 3 | Rian | 25 | Rock Concert 2023 | 2023 | 2023 | Gold | 100000.0 | 400 |
| 4 | Bima | 25 | Pop Party 2022 | 2022 | 2022 | Silver | 60000.0 | 250 |
df.duplicated()
0 False 1 False 2 False 3 True 4 False dtype: bool
# penghapusan semua baris duplikasi kecuali salah satunya
df.drop_duplicates(inplace=True)
df
| visitor_name | age | event_name | ticket_sold_year | event_year | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|---|---|
| 0 | Rian | 25 | Rock Concert 2023 | 2023 | 2023 | Gold | 100000.0 | 400 |
| 1 | Joni | 23 | Jazz Festival 2023 | 2023 | 2023 | Silver | 50000.0 | 350 |
| 2 | Lika | 18 | Pop Party 2022 | 2022 | 2022 | Bronze | 30000.0 | 800 |
| 4 | Bima | 25 | Pop Party 2022 | 2022 | 2022 | Silver | 60000.0 | 250 |
df = pd.DataFrame({'visitor_name':['Rian','Joni','Tara','Lika','Bima'],
'age':[25,23,18,21,25],
'event_name':['Rock Concert 2023','Jazz Festival 2022','Pop Party 2021','Rock Concert 2023','Pop Party 2021'],
'ticket_sold_year':[2023,2022,2021,2023,2021],
'event_year':[2023,2022,2021,2023,2021],
'ticket_type':['Gold','Silver','Bronze','Gold','Silver'],
'ticket_price':[100000.0,50000.0,30000.0,100000.0,60000.0],
'reward_point':[400,350,800,100,100]})
df['age'] = df['age'].astype('Int64')
df['ticket_sold_year'] = df['ticket_sold_year'].astype('Int64')
df['event_year'] = df['event_year'].astype('Int64')
df['ticket_price'] = df['ticket_price'].astype(float)
df['reward_point'] = df['reward_point'].astype('Int64')
df
| visitor_name | age | event_name | ticket_sold_year | event_year | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|---|---|
| 0 | Rian | 25 | Rock Concert 2023 | 2023 | 2023 | Gold | 100000.0 | 400 |
| 1 | Joni | 23 | Jazz Festival 2022 | 2022 | 2022 | Silver | 50000.0 | 350 |
| 2 | Tara | 18 | Pop Party 2021 | 2021 | 2021 | Bronze | 30000.0 | 800 |
| 3 | Lika | 21 | Rock Concert 2023 | 2023 | 2023 | Gold | 100000.0 | 100 |
| 4 | Bima | 25 | Pop Party 2021 | 2021 | 2021 | Silver | 60000.0 | 100 |
!pip install fast-ml
Requirement already satisfied: fast-ml in /opt/conda/lib/python3.10/site-packages (3.68)
from fast_ml.feature_selection import get_duplicate_features
df1 = df.copy()
# mendapatkan kolom-kolom dengan duplikasi values
duplicate_features = get_duplicate_features(df1)
print('Duplikasi kolom:\n')
print(duplicate_features)
Duplikasi kolom:
Desc feature1 feature2
0 Duplicate Values ticket_sold_year event_year
1 Duplicate Index event_name ticket_sold_year
2 Duplicate Index event_name event_year
# semua kolom dengan duplikasi values sebagai list
duplicate_features_list = duplicate_features.loc[duplicate_features['Desc']=='Duplicate Values', 'feature2'].to_list()
print('\nList kolom duplikasi values:\n')
print(duplicate_features_list)
List kolom duplikasi values: ['event_year']
# hapus semua kolom dengan duplikasi values
print('\nBanyaknya kolom sebelum penghapusan kolom duplikasi values: ' + str(df1.shape[1]))
df1.drop(columns=duplicate_features_list, inplace=True)
print('Banyaknya kolom setelah penghapusan kolom duplikasi values: ' + str(df1.shape[1]))
df1
Banyaknya kolom sebelum penghapusan kolom duplikasi values: 8 Banyaknya kolom setelah penghapusan kolom duplikasi values: 7
| visitor_name | age | event_name | ticket_sold_year | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|---|
| 0 | Rian | 25 | Rock Concert 2023 | 2023 | Gold | 100000.0 | 400 |
| 1 | Joni | 23 | Jazz Festival 2022 | 2022 | Silver | 50000.0 | 350 |
| 2 | Tara | 18 | Pop Party 2021 | 2021 | Bronze | 30000.0 | 800 |
| 3 | Lika | 21 | Rock Concert 2023 | 2023 | Gold | 100000.0 | 100 |
| 4 | Bima | 25 | Pop Party 2021 | 2021 | Silver | 60000.0 | 100 |
df2 = df.copy()
# mendapatkan kolom-kolom dengan duplikasi values
duplicate_features = get_duplicate_features(df2)
print('Duplikasi kolom:\n')
print(duplicate_features)
Duplikasi kolom:
Desc feature1 feature2
0 Duplicate Values ticket_sold_year event_year
1 Duplicate Index event_name ticket_sold_year
2 Duplicate Index event_name event_year
# semua kolom dengan duplikasi values sebagai list
duplicate_index_features_list = duplicate_features.loc[duplicate_features['Desc']=='Duplicate Index', 'feature2'].to_list()
print('\nList kolom duplikasi index:\n')
print(duplicate_features_list)
List kolom duplikasi index: ['event_year']
# hapus semua kolom dengan duplikasi values
print('\nBanyaknya kolom sebelum penghapusan kolom duplikasi values: ' + str(df2.shape[1]))
df2.drop(columns=duplicate_index_features_list, inplace=True)
print('Banyaknya kolom setelah penghapusan kolom duplikasi values: ' + str(df2.shape[1]))
df2
Banyaknya kolom sebelum penghapusan kolom duplikasi values: 8 Banyaknya kolom setelah penghapusan kolom duplikasi values: 6
| visitor_name | age | event_name | ticket_type | ticket_price | reward_point | |
|---|---|---|---|---|---|---|
| 0 | Rian | 25 | Rock Concert 2023 | Gold | 100000.0 | 400 |
| 1 | Joni | 23 | Jazz Festival 2022 | Silver | 50000.0 | 350 |
| 2 | Tara | 18 | Pop Party 2021 | Bronze | 30000.0 | 800 |
| 3 | Lika | 21 | Rock Concert 2023 | Gold | 100000.0 | 100 |
| 4 | Bima | 25 | Pop Party 2021 | Silver | 60000.0 | 100 |
Outlier adalah nilai yang jauh berbeda dari nilai lainnya dalam kumpulan data. Nilai ini muncul sebagai pengecualian dalam pola data yang ada. Nilai yang ada di outlier bisa jauh lebih tinggi maupun lebih rendah dibandingkan dengan nilai-nilai lain dalam dataset. Outlier bisa terjadi karena berbagai alasan, termasuk kesalahan pengukuran, kejadian langka, atau karena faktor lain yang tidak terduga. Outlier dapat menyebabkan hasil analisis statistik dan model prediksi menyimpang. Jika outlier tidak diidentifikasi dan diatasi, pola umum dalam data tersebut akan berubah dan menghasilkan kesimpulan yang tidak tepat.
# Import
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt
# Load Dataset
diabetics = load_diabetes()
# Membuat DataFrame
column_name = diabetics.feature_names
df_diabetics = pd.DataFrame(diabetics.data)
df_diabetics.columns = column_name
print(df_diabetics.head())
age sex bmi bp s1 s2 s3 \
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142
s4 s5 s6
0 -0.002592 0.019907 -0.017646
1 -0.039493 -0.068332 -0.092204
2 -0.002592 0.002861 -0.025930
3 0.034309 0.022688 -0.009362
4 -0.002592 -0.031988 -0.046641
# Box Plot
import seaborn as sns
sns.boxplot(df_diabetics['bmi'])
<Axes: >
#Posisi Outlier
print(np.where(df_diabetics['bmi']>0.10))
(array([ 32, 114, 138, 145, 256, 262, 327, 332, 362, 366, 367, 405]),)
df_diabetics['bmi']
0 0.061696
1 -0.051474
2 0.044451
3 -0.011595
4 -0.036385
...
437 0.019662
438 -0.015906
439 -0.015906
440 0.039062
441 -0.073030
Name: bmi, Length: 442, dtype: float64
Q1 = df_diabetics['bmi'].quantile(0.25)
Q3 = df_diabetics['bmi'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
# Batas Atas dan Batas Bawah
upper_array = np.where(df_diabetics['bmi'] >= upper)[0]
lower_array = np.where(df_diabetics['bmi'] <= lower)[0]
#menghilangkan outlier
df_diabetics.drop(index=upper_array, inplace=True)
df_diabetics.drop(index=lower_array, inplace=True)
sns.boxplot(df_diabetics['bmi'])
<Axes: >
Penanganan format yang salah atau tipe data yang tidak sesuai dalam bahasa Python melibatkan mengidentifikasi dan mengubah format atau tipe data yang tidak benar agar sesuai dengan kebutuhan analisis atau pemodelan data. Format atau tipe data yang salah dapat muncul karena kesalahan input, kesalahan konversi, atau masalah lain dalam proses pengumpulan data. Pentingnya penanganan format atau tipe data yang salah adalah untuk memastikan keakuratan dan keandalan analisis data. enanganan format atau tipe data yang salah dalam bahasa Python penting untuk memastikan integritas data dan keandalan analisis.
Berikut merupakan kesalahan-kesalahan yang umum terjadi dan penanganannya.
df = pd.DataFrame({'quantity':['100','250']})
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 quantity 2 non-null object dtypes: object(1) memory usage: 144.0+ bytes
df['quantity'] = df['quantity'].astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 quantity 2 non-null int64 dtypes: int64(1) memory usage: 144.0 bytes
df = pd.DataFrame({'date':['20220810','20220815']})
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 2 non-null object dtypes: object(1) memory usage: 144.0+ bytes
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 2 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 144.0 bytes
df
| date | |
|---|---|
| 0 | 2022-08-10 |
| 1 | 2022-08-15 |
df = pd.DataFrame({'product_name':['ICE TEA','Ice Tea','ice tea']})
print(df)
product_name 0 ICE TEA 1 Ice Tea 2 ice tea
#mengubah semua menjadi huruf kapital
df['product_name'] = df['product_name'].str.upper()
print(df)
product_name 0 ICE TEA 1 ICE TEA 2 ICE TEA
#mengubah semua menjadi huruf kecil
df['product_name'] = df['product_name'].str.lower()
print(df)
product_name 0 ice tea 1 ice tea 2 ice tea
# Mengubah hanya huruf pertama menjadi kapital
df['product_name'] = df['product_name'].apply(lambda x: x.capitalize())
print(df)
product_name 0 Ice tea 1 Ice tea 2 Ice tea
df = pd.DataFrame({'product_name':[' ICE TEA ','ICE COFFEE ',' RAINBOW CAKE']})
print(df)
product_name 0 ICE TEA 1 ICE COFFEE 2 RAINBOW CAKE
df['product_name'] = df['product_name'].str.strip()
print(df)
product_name 0 ICE TEA 1 ICE COFFEE 2 RAINBOW CAKE
df = pd.DataFrame({'product_name':['ICE TEA (SOLD PER 1 GLASS)','RAINBOW CAKE (SOLD PER 1 SLICE)']})
print(df)
product_name 0 ICE TEA (SOLD PER 1 GLASS) 1 RAINBOW CAKE (SOLD PER 1 SLICE)
df['product_name'] = df['product_name'].str.replace(r' \(SOLD.*', '', regex=True)
print(df)
product_name 0 ICE TEA 1 RAINBOW CAKE
df = pd.DataFrame({'address':['jln. harapan','jl. baru']})
print(df)
address 0 jln. harapan 1 jl. baru
df['address'] = df['address'].str.replace(r'jl.* ', 'jalan ', regex=True)
print(df)
address 0 jalan harapan 1 jalan baru
Feature Scaling adalah suatu cara untuk membuat numerical data pada dataset memiliki rentang nilai (scale) yang sama. Tidak ada lagi satu variabel data yang mendominasi variabel data lainnya.
Pada umumnya, rumus yang digunakan untuk proses Feature Scaling ini adalah Standarisation dan Normalisation:
Beberapa tipe dari feature scaling:
dll.
Standarisasi adalah teknik penskalaan di mana nilai-nilai diatur sekitar nilai rata-rata dengan deviasi standar satu unit. Hal ini berarti bahwa nilai rata-rata atribut menjadi nol dan distribusi hasil memiliki deviasi standar satu unit.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(palette="rainbow", style="darkgrid")
%matplotlib inline
# Menggunakan data pertemuan 3
df = pd.read_csv("https://raw.githubusercontent.com/Lucky77777777/Praktikum-Pengantar-Data-Mining/main/3.%20PDM%3A%20Data%20Cleaning%20(Part%202)%20%26%20Pre-Processing/data%20pertemuan%203.csv", usecols=["Age", "Fare"])
df.fillna(value={"Age": df["Age"].mean()}, inplace=True)
# mengimpor StandardScaler dari scikit-learn yang digunakan untuk Standarisasi.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # creating an instance of the class object
df_new = pd.DataFrame(sc.fit_transform(df), columns=df.columns) #fit dan transformasi dengan StandardScaler
# membuat perbandingan scatter plot sebelum dan sesudah di standardisasi
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Standardization", fontsize=18)
sns.scatterplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Standardization", fontsize=18)
sns.scatterplot(data = df_new, color="red")
plt.tight_layout()
plt.show()
#membuat perbandingan kde plot sebelum dan sesudah di standardisasi
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("PDF Before Standardization", fontsize=18)
sns.kdeplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("PDF After Standardization", fontsize=18)
sns.kdeplot(data = df_new, color="red")
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
Perhatikan bagaimana mean dari distribusi ini sangat dekat dengan 0 dan deviasi standarnya tepat 1. Inilah yang dinamakan Standardisasi. Terlihat bahwa terdapat beberapa pencilan di variabel "Fare" dan itulah yang menyebabkan rata-rata tidak turun ke 0. Perhatikan juga bagaimana dalam Scatterplot, skala berubah dan distribusi berpusat di sekitar 0. Pada fungsi kepadatan probabilitas, plot kde persis sama, hal ini menunjukkan bagaimana distribusi tidak dipengaruhi oleh standardisasi.
Normalisasi adalah teknik yang sering diterapkan sebagai bagian dari persiapan data untuk machine learning. Tujuan normalisasi adalah untuk mengubah nilai kolom numerik dalam kumpulan data untuk menggunakan skala yang sama, tanpa mendistorsi perbedaan dalam rentang nilai atau kehilangan informasi
Normalisasi min-max adalah salah satu cara paling umum untuk menormalkan data. Untuk setiap fitur, nilai minimum dari fitur tersebut diubah menjadi 0, nilai maksimum diubah menjadi 1, dan setiap nilai lainnya diubah menjadi desimal antara 0 dan 1.
# mengimpor MinMaxScaler dari scikit-learn yang digunakan untuk normalisasi.
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler() # creating an instance of the class object
df_new_mm = pd.DataFrame(mm.fit_transform(df), columns=df.columns) #fit dan transformasi dataframe dengan MinMaxScaler
#membuat perbandingan scatter plot sebelum dan sesudah Min Max scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Min Max Scaling", fontsize=18)
sns.scatterplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Min Max Scaling", fontsize=18)
sns.scatterplot(data = df_new_mm, color="red")
plt.tight_layout()
plt.show()
#membuat perbandingan kde plot sebelum dan sesudah Min Max scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("PDF Before Min Max Scaling", fontsize=18)
sns.kdeplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("PDF After Min Max Scaling", fontsize=18)
sns.kdeplot(data = df_new_mm, color="red")
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
Normalisasi Min Max akan bekerja paling baik ketika nilai maksimum dan minimum sangat berbeda dan diketahui.
Maksimum absolute scaling bekerja dengan menykalakan setiap fitur dengan nilai absolut maksimumnya. Estimator ini menskalakan dan menerjemahkan setiap fitur secara individual sehingga nilai absolut maksimum dari setiap fitur dalam set pelatihan adalah 1.0. Scaler ini tidak menggeser/memusatkan data, sehingga tidak merusak sparsitas/kelangkaan data. Scaler ini juga dapat diterapkan pada matriks CSR atau CSC
# mengimpor MaxAbsScaler dari scikit-learn yang digunakan untuk maximum absolute scaler.
from sklearn.preprocessing import MaxAbsScaler
ma = MaxAbsScaler() # creating an instance of the class object
df_new_ma = pd.DataFrame(ma.fit_transform(df), columns=df.columns) #fit dan transformasi dataframe dengan Max Absolute Scaling
#membuat perbandingan scatter plot sebelum dan sesudah Max Absolute scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Max Absolute Scaling", fontsize=18)
sns.scatterplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Max Absolute Scaling", fontsize=18)
sns.scatterplot(data = df_new_ma, color="red")
plt.tight_layout()
plt.show()
#membuat perbandingan kde plot sebelum dan sesudah Max Absolute scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("PDF Before Max Absolute Scaling", fontsize=18)
sns.kdeplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("PDF After Max Absolute Scaling", fontsize=18)
sns.kdeplot(data = df_new_ma, color="red")
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
Max Absolute Scaling akan bekerja lebih baik pada data yang jarang atau ketika sebagian besar nilainya 0.
Scaler ini menghilangkan median dan menskalakan data dengan rentang kuantil (default IQR: Interquartile Range). IQR adalah kisaran antara kuartil pertama (kuantil ke-25) dan kuartil ke-3 (kuantil ke-75).
# mengimpor RobustScaler dari scikit-learn yang digunakan untuk Robust Scaling.
from sklearn.preprocessing import RobustScaler
rs = RobustScaler() # creating an instance of the class object
df_new_rs = pd.DataFrame(rs.fit_transform(df), columns=df.columns) #fit dan transformasi dataframe dengan Robust Scaling
# membuat perbandingan scatter plot sebelum dan sesudah Robust scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Robust Scaling", fontsize=18)
sns.scatterplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Robust Scaling", fontsize=18)
sns.scatterplot(data = df_new_rs, color="red")
plt.tight_layout()
plt.show()
# membuat perbandingan kde plot sebelum dan sesudah Robust scaling
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("PDF Before Robust Scaling", fontsize=18)
sns.kdeplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("PDF After Robust Scaling", fontsize=18)
sns.kdeplot(data = df_new_rs, color="red")
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
Robust Scaling paling baik digunakan untuk data yang memiliki outlier
Mean Normalization sangat mirip dengann Min Max Scaler, hanya saja kita menggunakan mean untuk menormalkan data. Menghapus rata-rata dari data dan menskalakannya menjadi nilai maks dan min.
Scikitlearn tidak memiliki kelas khusus untuk normalisasi rata-rata. Namun, kita dapat melakukan ini dengan sangat mudah menggunakan numpy.
import numpy as np
# Melakukan mean normalization
df_normalized = (df - df.mean()) / df.std()
# Membuat DataFrame baru dengan hasil normalisasi
df_new_mn = pd.DataFrame(df_normalized, columns=df.columns)
# membuat perbandingan scatter plot sebelum dan sesudah Mean Normalization
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("Scatterplot Before Mean Normalization", fontsize=18)
sns.scatterplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("Scatterplot After Mean Normalization", fontsize=18)
sns.scatterplot(data = df_new_mn, color="red")
plt.tight_layout()
plt.show()
# membuat perbandingan kde plot sebelum dan sesudah Mean Normalization
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.title("PDF Before Mean Normalization", fontsize=18)
sns.kdeplot(data = df, color="blue")
plt.subplot(1,2,2)
plt.title("PDF After Mean Normalization", fontsize=18)
sns.kdeplot(data = df_new_mn, color="red")
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
Di bawah ini terdapat beberapa tips dasar yang dapat digunakan saat mencoba melakukan scaling:
Feature Encoding adalah proses mengubah fitur kategorikal menjadi fitur numerik. Hal ini diperlukan karena algoritme machine learning hanya dapat menangani fitur numerik. Ada banyak cara berbeda untuk melakukan encoding pada fitur kategorikal, dan setiap metode memiliki kelebihan dan kekurangannya masing-masing. Pada kesempatan ini, kita akan mengeksplorasi beberapa metode yang paling populer untuk mengkodekan fitur kategorikal, seperti:
Label Encoding digunakan untuk mengonversi variabel kategorikal ke dalam format numerik. Setiap kategori diberi label bilangan bulat yang unik. Misalnya, "merah" dapat dikodekan sebagai 0, "hijau" sebagai 1, dan "biru" sebagai 2.
Label Encoding digunakan ketika:
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
df = sns.load_dataset('tips')
df.head()
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
df['time'].value_counts()
time Dinner 176 Lunch 68 Name: count, dtype: int64
df['day'].value_counts()
day Sat 87 Sun 76 Thur 62 Fri 19 Name: count, dtype: int64
# membuat objek label encoder
le = LabelEncoder()
# Encode kolom 'time'
df['encoded_time'] = le.fit_transform(df['time'])
print(df['encoded_time'].unique())
[0 1]
df['encoded_time'].value_counts()
encoded_time 0 176 1 68 Name: count, dtype: int64
Ordinal Encoding digunakan ketika variabel kategorikal memiliki urutan atau peringkat yang melekat. Setiap kategori diberi nilai numerik berdasarkan urutan atau peringkatnya. Misalnya, "low" dapat dikodekan sebagai 0, "medium" sebagai 1, dan "high" sebagai 2.
Ordinal Encoding digunakan ketika:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Fri' , 'Sat', 'Sun' , 'Thur']])
df['encoded_day'] = oe.fit_transform(df[['day']])
df['encoded_day'].value_counts()
encoded_day 1.0 87 2.0 76 3.0 62 0.0 19 Name: count, dtype: int64
One-Hot Encoding digunakan untuk merepresentasikan variabel kategorikal sebagai vektor biner. Setiap kategori dikonversi menjadi kolom biner baru, ada atau tidaknya suatu kategori diindikasikan dengan angka 1 atau 0.
One-Hot Encoding digunakan ketika:
from sklearn.preprocessing import OneHotEncoder
titanic = sns.load_dataset('titanic')
titanic.head()
| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
# Membuat instance dari OneHotEncoder tanpa menyebutkan 'sparse'
onehot_encoder = OneHotEncoder()
# Melakukan fit dan transformasi pada kolom 'embarked'
embarked_onehot = onehot_encoder.fit_transform(titanic[['embarked']])
# Mengonversi hasilnya menjadi DataFrame
embarked_onehot_df = pd.DataFrame(embarked_onehot.toarray(), columns=onehot_encoder.get_feature_names_out(['embarked']))
# Menggabungkan DataFrame baru dengan DataFrame asli
titanic = pd.concat([titanic.reset_index(drop=True), embarked_onehot_df.reset_index(drop=True)], axis=1)
titanic.head()
| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | embarked_C | embarked_Q | embarked_S | embarked_nan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False | 1.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True | 0.0 | 0.0 | 1.0 | 0.0 |
Discretization (diskritisasi) digunakan untuk mereduksi sekumpulan nilai yang terdapat pada atribut continuous, dengan membagi range dari atribut ke dalam interval. Proses diskritisasi secara umum terdiri dari 4 tahapan, yaitu
Diskritisasi fitur kontinu dapat meningkatkan kinerja beberapa model seperti Decision Tree dan Naive Bayes. Dengan membuat nilai menjadi diskrit, pelatihan model lebih cepat, misalnya pada decision tree yang perlu mempertimbangkan semua nilai fitur. Diskritisasi juga membuat data lebih mudah dipahami dan mengurangi pengaruh outlier. Dengan mengelompokkan nilai ke dalam interval, distribusi yang miring tersebar lebih merata. Secara keseluruhan, diskritisasi membuat data menjadi lebih sederhana, mempercepat proses pembelajaran, dan dapat meningkatkan akurasi hasil.
Umumnya terdapat 5 metode yang digunakan untuk melakukan diskritisasi pada atribut continuous, yaitu : binning, cluster analysis, histogram analysis, entropy-based discretization, dan segmentation by "natural partitioning". Pada modul ini, difokuskan pada metode binning
# Nilai unik dari fitur age pada data titanic
titanic1 = titanic.copy()
titanic1.age.value_counts().index.sort_values()
Index([0.42, 0.67, 0.75, 0.83, 0.92, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0,
8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 14.5, 15.0, 16.0, 17.0, 18.0,
19.0, 20.0, 20.5, 21.0, 22.0, 23.0, 23.5, 24.0, 24.5, 25.0, 26.0, 27.0,
28.0, 28.5, 29.0, 30.0, 30.5, 31.0, 32.0, 32.5, 33.0, 34.0, 34.5, 35.0,
36.0, 36.5, 37.0, 38.0, 39.0, 40.0, 40.5, 41.0, 42.0, 43.0, 44.0, 45.0,
45.5, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 55.5,
56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 70.0,
70.5, 71.0, 74.0, 80.0],
dtype='float64', name='age')
# Menyederhanakan nilai fitur age menjadi 4 kelompok usia
titanic1["age_grup1"] = pd.qcut(x = titanic1['age'], q = 4)
titanic1["age_grup2"] = pd.qcut(x = titanic1['age'], q = 4, labels = ['Child', 'Young Adults', 'Middle-age Adults', 'Old Adults']) # tambahkan label
# Label telah disesuaikan dengan urutan nilai
print(titanic1['age_grup1'].value_counts())
print(titanic1['age_grup2'].value_counts())
# Menentukan range nilai pada tiap interval
titanic1["age_grup3"] = pd.cut(x = titanic1['age'], bins = [0, 15, 30, 45, 100],
labels = ['Child', 'Young Adults', 'Middle-age Adults', 'Old Adults'])
titanic1[['age', 'age_grup1', 'age_grup2', 'age_grup3']].head(15)
age_grup1 (20.125, 28.0] 183 (0.419, 20.125] 179 (38.0, 80.0] 177 (28.0, 38.0] 175 Name: count, dtype: int64 age_grup2 Young Adults 183 Child 179 Old Adults 177 Middle-age Adults 175 Name: count, dtype: int64
| age | age_grup1 | age_grup2 | age_grup3 | |
|---|---|---|---|---|
| 0 | 22.0 | (20.125, 28.0] | Young Adults | Young Adults |
| 1 | 38.0 | (28.0, 38.0] | Middle-age Adults | Middle-age Adults |
| 2 | 26.0 | (20.125, 28.0] | Young Adults | Young Adults |
| 3 | 35.0 | (28.0, 38.0] | Middle-age Adults | Middle-age Adults |
| 4 | 35.0 | (28.0, 38.0] | Middle-age Adults | Middle-age Adults |
| 5 | NaN | NaN | NaN | NaN |
| 6 | 54.0 | (38.0, 80.0] | Old Adults | Old Adults |
| 7 | 2.0 | (0.419, 20.125] | Child | Child |
| 8 | 27.0 | (20.125, 28.0] | Young Adults | Young Adults |
| 9 | 14.0 | (0.419, 20.125] | Child | Child |
| 10 | 4.0 | (0.419, 20.125] | Child | Child |
| 11 | 58.0 | (38.0, 80.0] | Old Adults | Old Adults |
| 12 | 20.0 | (0.419, 20.125] | Child | Young Adults |
| 13 | 39.0 | (38.0, 80.0] | Old Adults | Middle-age Adults |
| 14 | 14.0 | (0.419, 20.125] | Child | Child |
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
sns.histplot(x = 'age', kde = True, data = titanic1, ax=axes[0, 0])
axes[0, 0].set_title('Age')
sns.countplot(x='age_grup3', data=titanic1, ax=axes[0, 1])
axes[0, 1].set_title('Age Group3')
sns.countplot(x = 'age_grup1', data = titanic1, ax=axes[1, 0])
axes[1, 0].set_title('Age Group1')
sns.countplot(x='age_grup2', data=titanic1, ax=axes[1,1])
axes[1,1].set_title('Age Group2')
plt.tight_layout()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
grouped_vals = vals.groupby(grouper)
Pada feature construction, fitur baru dibentuk dari atribut yang sudah ada dan ditambahkan bersama atribut lainnya untuk membantu meningkatkan ketelitian/ketepatan dan pemahaman struktur dalam high-dimensional data. Contohnya:
# Menggunakan data pertemuan 3
titanic2 = pd.read_csv("https://raw.githubusercontent.com/Lucky77777777/Praktikum-Pengantar-Data-Mining/main/3.%20PDM%3A%20Data%20Cleaning%20(Part%202)%20%26%20Pre-Processing/data%20pertemuan%203.csv")
titanic2.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
PassengerId is the unique id of the row and it doesn't have any effect on targetSurvived is the target variable we are trying to predict (0 or 1):Pclass (Passenger Class) is the socio-economic status of the passenger and it is a categorical ordinal feature which has 3 unique values (1, 2 or 3):Name, Sex and Age are self-explanatorySibSp is the total number of the passengers' siblings and spouseParch is the total number of the passengers' parents and childrenTicket is the ticket number of the passengerFare is the passenger fareCabin is the cabin number of the passengerEmbarked is port of embarkation and it is a categorical feature which has 3 unique values (C, Q or S):# Pembentukan fitur jumlah anggota keluarga
titanic2['FamilySize'] = titanic2['SibSp'] + titanic2['Parch'] + 1
# Is Alone?
titanic2['Alone'] = 0
titanic2.loc[titanic2['FamilySize'] == 1, 'Alone'] = 1
# Title
titanic2['Title'] = titanic2['Name'].str.split(', ', expand = True)[1].str.split('.', expand=True)[0]
print(titanic2.Title.value_counts())
titanic2.head(10)
Title Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 the Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: count, dtype: int64
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | FamilySize | Alone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 1 | 1 | Mr |
| 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | 1 | 1 | Mr |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 1 | 1 | Mr |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S | 5 | 0 | Master |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S | 3 | 0 | Mrs |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C | 2 | 0 | Mrs |
Dimentionality reduction (reduksi dimensi) adalah proses mengurangi jumlah fitur dalam kumpulan data sambil mempertahankan sebanyak mungkin informasi. Teknik ini bermanfaat untuk mengurangi kompleksitas model, meningkatkan kinerja algoritma pembelajaran, menghemat waktu komputasi, dan memudahkan visualisasi data. Dengan mengurangi dimensi, kita dapat mengatasi masalah seperti kolerasi variabel serta menghindari overfitting. Beberapa metode yang cukup umum digunakan adalah PCA dan Feature Selection.
Principal Component Analysis (PCA) atau Analisis Komponen Utama adalah salah satu teknik linier terkemuka dalam reduksi dimensionalitas. Metode ini melakukan pemetaan langsung data ke ruang dimensional yang lebih rendah dengan cara yang memaksimalkan varians data dalam representasi dimensional rendah.
Pada dasarnya, ini adalah prosedur statistik yang mengubah 'n' koordinat dataset secara ortogonal menjadi set koordinat baru sebanyak n, yang dikenal sebagai komponen utama. Konversi ini menghasilkan penciptaan komponen utama pertama yang memiliki varians maksimum. Setiap komponen utama yang berikutnya membawa varians tertinggi yang mungkin, dengan syarat bahwa ia ortogonal (tidak berkorelasi) dengan komponen sebelumnya.
Konversi PCA sensitif terhadap penskalaan relatif variabel asli. Oleh karena itu, rentang kolom data harus dinormalisasi terlebih dahulu sebelum menerapkan metode PCA. Hal lain yang perlu diingat adalah bahwa penggunaan pendekatan PCA akan membuat dataset kehilangan interpretabilitasnya. Jadi, jika interpretabilitas penting untuk analisis Anda, PCA bukanlah metode reduksi dimensionalitas yang tepat untuk digunakan.
Principal Component Analysis (PCA) hanya berfungsi untuk data numerik. Langkah-langkah yang dilakukan yaitu:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
import seaborn as sns
import numpy as np
import pandas as pd
sns.set()
import matplotlib.pyplot as plt
import os
import missingno as msno
data = sns.load_dataset('iris')
data.head()
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
# Label Encoding (Mengubah kategori menjadi numerik)
data.species.replace({'setosa': 0,'versicolor': 1, 'virginica': 2}, inplace=True)
data.head()
/tmp/ipykernel_254/2379552215.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
data.species.replace({'setosa': 0,'versicolor': 1, 'virginica': 2}, inplace=True)
/tmp/ipykernel_254/2379552215.py:2: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
data.species.replace({'setosa': 0,'versicolor': 1, 'virginica': 2}, inplace=True)
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
# visualization for the data
msno.bar(data)
<Axes: >
sns.countplot(y=data.species ,data=data)
plt.xlabel("Count of each Target class")
plt.ylabel("Target classes")
plt.show()
# individual ploting for features
fig,ax = plt.subplots(nrows = 2, ncols=2, figsize=(10,8))
row = 0
col = 0
for i in range(len(data.columns) -1):
if col > 1:
row += 1
col = 0
axes = ax[row,col]
sns.boxplot(x = data['species'], y = data[data.columns[i]],ax = axes)
col += 1
plt.tight_layout()
plt.title("Individual Features by Class")
plt.show()
sns.pairplot(data, hue = 'species')
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
<seaborn.axisgrid.PairGrid at 0x798df902d270>
# heatmap plot for the correlation
plt.figure(figsize=(10,10))
sns.heatmap(data.corr(), annot=True,cmap='viridis')
<Axes: >
# plots for the features distribution
data.hist(figsize=(12,10),bins = 15)
plt.title("Features Distribution")
plt.show()
X = data.drop(['species'], axis=1)
y = data.species
pca = PCA()
X_new = pca.fit_transform(X)
Tentukan jumlah principal component yang optimal¶
pca.get_covariance()
array([[ 0.686, -0.042, 1.274, 0.516],
[-0.042, 0.19 , -0.33 , -0.122],
[ 1.274, -0.33 , 3.116, 1.296],
[ 0.516, -0.122, 1.296, 0.581]])
# getting variance ratio
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
[0.925 0.053 0.017 0.005]
# plot with individual explained variance and principal components
with plt.style.context('dark_background'):
plt.figure(figsize=(6, 4))
plt.bar(range(4), explained_variance, alpha=0.5, align='center',
label='individual explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
Menggunakan 3 PC¶
pca = PCA(n_components=3)
X_new = pca.fit_transform(X)
# Visualising
from matplotlib.colors import ListedColormap
X_set, y_set = X_new, y
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('PCA')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()
/tmp/ipykernel_254/338010867.py:9: UserWarning: *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points. plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
# heatmap plot for the correlation
plt.figure(figsize=(8,8))
sns.heatmap(pd.DataFrame(X_new).corr(), annot = True, cmap = 'viridis')
<Axes: >
Fitur data yang digunakan dalam melatih model machine learning memiliki dampak besar terhadap kinerja yang dapat dicapai. Fitur yang tidak relevan atau sebagian relevan dapat berdampak negatif pada kinerja model, sehingga mempengaruhi akurasi dan kecepatan pelatihan. Seleksi fitur adalah proses penting yang dilakukan untuk secara otomatis memilih fitur-fitur yang paling berkontribusi terhadap variabel prediksi atau output yang diinginkan. Tujuan dari teknik seleksi fitur dalam machine learning adalah untuk menemukan kumpulan (kombinasi) fitur terbaik yang memungkinkan pembangunan model yang dioptimalkan dari fenomena yang dipelajari. Dengan mengurangi overfitting, meningkatkan akurasi, dan mengurangi waktu pelatihan, seleksi fitur membantu meningkatkan kualitas dan efisiensi dari model machine learning yang dibangun.
Secara umum, metode seleksi fitur dikelompokkan menjadi metode filter, metode wrapper, dan metode embedded.
Pada modul ini, akan dibahas beberapa teknik feature selection, yaitu Variance Treshold, ANOVA, dan Mutual Information
Variance Treshold adalah pendekatan dasar yang sederhana dalam seleksi fitur. Ini menghapus semua fitur yang variansnya tidak memenuhi suatu ambang batas. Secara default, itu menghapus semua fitur dengan varians nol, yaitu fitur-fitur dengan nilai yang sama dalam semua sampel. Metode ini mengasumsikan bahwa fitur dengan varians yang lebih tinggi mungkin mengandung informasi yang lebih berguna, namun perlu dicatat bahwa metode ini tidak memperhitungkan hubungan antara variabel fitur atau variabel fitur dan target, yang merupakan salah satu kelemahan dari metode filter.
from sklearn.datasets import make_classification
from sklearn.feature_selection import VarianceThreshold
# Toy dataset with redundant and constant features
X, y = make_classification(
n_samples=1000,
n_features=10,
n_classes=2,
random_state=10,
)
X = pd.DataFrame(X)
display(X.head())
# Add constant features
X[[0, 5, 9]] = 1
display(X.head())
# To remove constant features
sel = VarianceThreshold(threshold = 0)
# fit finds the features with zero variance
X_t = pd.DataFrame(sel.fit_transform(X))
display(X_t)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.005838 | -0.376539 | -0.620180 | -0.157567 | -1.120805 | -0.589091 | -1.574578 | 1.678046 | 1.080180 | 0.353587 |
| 1 | 0.411180 | 0.762409 | -0.784210 | -0.096479 | -0.408758 | -0.665780 | 0.210942 | -0.850449 | -0.461301 | 1.062237 |
| 2 | -1.525408 | 2.227934 | 0.547727 | -0.341481 | -0.817577 | 0.423091 | -2.663678 | 2.440042 | 1.698919 | -0.705302 |
| 3 | -1.374563 | 0.061129 | -0.995868 | -0.214351 | -0.558957 | 0.064870 | -2.149167 | 2.294192 | -1.383965 | 0.806924 |
| 4 | -0.549798 | 0.046349 | 0.834756 | -0.104845 | -0.455528 | -0.410938 | -0.911018 | 0.898098 | 1.068259 | 0.384683 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | -0.376539 | -0.620180 | -0.157567 | -1.120805 | 1 | -1.574578 | 1.678046 | 1.080180 | 1 |
| 1 | 1 | 0.762409 | -0.784210 | -0.096479 | -0.408758 | 1 | 0.210942 | -0.850449 | -0.461301 | 1 |
| 2 | 1 | 2.227934 | 0.547727 | -0.341481 | -0.817577 | 1 | -2.663678 | 2.440042 | 1.698919 | 1 |
| 3 | 1 | 0.061129 | -0.995868 | -0.214351 | -0.558957 | 1 | -2.149167 | 2.294192 | -1.383965 | 1 |
| 4 | 1 | 0.046349 | 0.834756 | -0.104845 | -0.455528 | 1 | -0.911018 | 0.898098 | 1.068259 | 1 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | -0.376539 | -0.620180 | -0.157567 | -1.120805 | -1.574578 | 1.678046 | 1.080180 |
| 1 | 0.762409 | -0.784210 | -0.096479 | -0.408758 | 0.210942 | -0.850449 | -0.461301 |
| 2 | 2.227934 | 0.547727 | -0.341481 | -0.817577 | -2.663678 | 2.440042 | 1.698919 |
| 3 | 0.061129 | -0.995868 | -0.214351 | -0.558957 | -2.149167 | 2.294192 | -1.383965 |
| 4 | 0.046349 | 0.834756 | -0.104845 | -0.455528 | -0.911018 | 0.898098 | 1.068259 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | -0.056868 | -0.336122 | -0.700527 | 0.129147 | -1.850227 | -0.770335 | -0.888264 |
| 996 | 0.566137 | 0.627820 | -0.049620 | 1.765735 | 0.046187 | -0.337826 | -0.161875 |
| 997 | 0.743103 | 1.136757 | -0.103349 | -0.769979 | -2.125616 | 2.847156 | -0.911452 |
| 998 | -1.433237 | -2.549200 | 0.134018 | 1.092140 | 0.197284 | 0.397779 | -0.716400 |
| 999 | 0.730112 | -1.878886 | 0.501490 | -0.411241 | 1.205595 | 0.741542 | 0.308865 |
1000 rows × 7 columns
ANOVA merupakan uji hipotesis statistik parametrik untuk menentukan apakah rata-rata dari dua atau lebih sampel data (seringkali tiga atau lebih) berasal dari distribusi yang sama atau tidak. Metode ini menggunakan statistik F (uji F) yang merupakan jenis uji statistik yang menghitung rasio antara nilai varians, seperti varians dari dua sampel yang berbeda atau varians yang dijelaskan dan tidak dijelaskan oleh suatu uji statistik.
ANOVA umumnya digunakan ketika satu variabel bersifat numerik dan satu variabel bersifat kategorikal, seperti variabel input numerik dan variabel target klasifikasi yang tentu bersifat kategorikal. Hasil dari uji ini dapat digunakan untuk seleksi fitur di mana fitur-fitur yang independen (tidak memberikan informasi yang signfikan) dari variabel target dapat dihapus dari kumpulan data.
import pandas as pd
import matplotlib.pyplot as plt
from numpy import set_printoptions
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
# load dataset
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target
kolom = X.columns
display(X.head())
# Rank and select features
sel = SelectKBest(score_func = f_classif, k = 10) # k is the number of features to be selected
X_t = sel.fit_transform(X, y)
# summarize scores
set_printoptions(precision=3)
print('feature importance: ', sel.scores_)
# P-values
print('pvalues: ', sel.pvalues_)
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
feature importance: [6.470e+02 1.181e+02 6.972e+02 5.731e+02 8.365e+01 3.132e+02 5.338e+02 8.617e+02 6.953e+01 9.346e-02 2.688e+02 3.909e-02 2.539e+02 2.437e+02 2.558e+00 5.325e+01 3.901e+01 1.133e+02 2.412e-02 3.468e+00 8.608e+02 1.496e+02 8.979e+02 6.616e+02 1.225e+02 3.043e+02 4.367e+02 9.644e+02 1.189e+02 6.644e+01] pvalues: [8.466e-096 4.059e-025 8.436e-101 4.735e-088 1.052e-018 3.938e-056 9.967e-084 7.101e-116 5.733e-016 7.599e-001 9.739e-050 8.433e-001 1.652e-047 5.896e-046 1.103e-001 9.976e-013 8.260e-010 3.072e-024 8.766e-001 6.307e-002 8.482e-116 1.078e-030 5.771e-119 2.829e-097 6.575e-026 7.070e-055 2.465e-072 1.969e-124 2.951e-025 2.316e-015]
feature_scores = list(zip(sel.scores_, kolom))
sorted_feature_scores = sorted(feature_scores, reverse=True)
num_list = []
col_list = []
for i in range(len(feature_scores)):
num_list.append((sorted_feature_scores[i])[0])
col_list.append((sorted_feature_scores [i])[1])
# 10 best features
plt.bar(col_list[0:10],num_list[0:10])
plt.xticks(rotation = 90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [Text(0, 0, 'worst concave points'), Text(1, 0, 'worst perimeter'), Text(2, 0, 'mean concave points'), Text(3, 0, 'worst radius'), Text(4, 0, 'mean perimeter'), Text(5, 0, 'worst area'), Text(6, 0, 'mean radius'), Text(7, 0, 'mean area'), Text(8, 0, 'mean concavity'), Text(9, 0, 'worst concavity')])
result = pd.DataFrame({'Feature': kolom, 'F-value': sel.scores_, 'P-value': sel.pvalues_})
result['sig'] = (result['P-value'] < 0.05).astype(int) # is significance?
result = result.sort_values(by='F-value', ascending=False).reset_index(drop = True)
result
| Feature | F-value | P-value | sig | |
|---|---|---|---|---|
| 0 | worst concave points | 964.385393 | 1.969100e-124 | 1 |
| 1 | worst perimeter | 897.944219 | 5.771397e-119 | 1 |
| 2 | mean concave points | 861.676020 | 7.101150e-116 | 1 |
| 3 | worst radius | 860.781707 | 8.482292e-116 | 1 |
| 4 | mean perimeter | 697.235272 | 8.436251e-101 | 1 |
| 5 | worst area | 661.600206 | 2.828848e-97 | 1 |
| 6 | mean radius | 646.981021 | 8.465941e-96 | 1 |
| 7 | mean area | 573.060747 | 4.734564e-88 | 1 |
| 8 | mean concavity | 533.793126 | 9.966556e-84 | 1 |
| 9 | worst concavity | 436.691939 | 2.464664e-72 | 1 |
| 10 | mean compactness | 313.233079 | 3.938263e-56 | 1 |
| 11 | worst compactness | 304.341063 | 7.069816e-55 | 1 |
| 12 | radius error | 268.840327 | 9.738949e-50 | 1 |
| 13 | perimeter error | 253.897392 | 1.651905e-47 | 1 |
| 14 | area error | 243.651586 | 5.895521e-46 | 1 |
| 15 | worst texture | 149.596905 | 1.078057e-30 | 1 |
| 16 | worst smoothness | 122.472880 | 6.575144e-26 | 1 |
| 17 | worst symmetry | 118.860232 | 2.951121e-25 | 1 |
| 18 | mean texture | 118.096059 | 4.058636e-25 | 1 |
| 19 | concave points error | 113.262760 | 3.072309e-24 | 1 |
| 20 | mean smoothness | 83.651123 | 1.051850e-18 | 1 |
| 21 | mean symmetry | 69.527444 | 5.733384e-16 | 1 |
| 22 | worst fractal dimension | 66.443961 | 2.316432e-15 | 1 |
| 23 | compactness error | 53.247339 | 9.975995e-13 | 1 |
| 24 | concavity error | 39.014482 | 8.260176e-10 | 1 |
| 25 | fractal dimension error | 3.468275 | 6.307355e-02 | 0 |
| 26 | smoothness error | 2.557968 | 1.102966e-01 | 0 |
| 27 | mean fractal dimension | 0.093459 | 7.599368e-01 | 0 |
| 28 | texture error | 0.039095 | 8.433320e-01 | 0 |
| 29 | symmetry error | 0.024117 | 8.766418e-01 | 0 |
Mutual Information adalah sebuah metrik yang digunakan untuk mengukur seberapa tergantungnya dua variabel acak satu sama lain. Nilai Mutual Information selalu tidak negatif, dan akan nol jika dan hanya jika kedua variabel acak tersebut benar-benar independen satu sama lain. Semakin tinggi nilai Mutual Information, semakin besar ketergantungan antara kedua variabel.
Misalnya, jika kita memiliki dua variabel acak X dan Y, Mutual Information antara keduanya dapat dihitung menggunakan formula:
$$ I(X ; Y) = H(X) - H(X | Y) $$Di mana $I(X; Y)$ adalah Mutual Information antara X dan Y, $H(X)$ adalah entropi untuk X, dan $H(X | Y)$ adalah entropi kondisional untuk X diberikan Y. Hasilnya dalam satuan bit.
Salah satu cara untuk memahami konsep Mutual Information adalah melalui konsep Entropi. Entropi di sini menggambarkan seberapa acaknya atau tidak teraturnya sebuah variabel acak. Mutual Information membantu dalam mengurangi entropi dengan memberikan informasi tentang ketergantungan antar variabel.
Dalam bahasa sederhana, Mutual Information adalah seberapa banyak informasi yang dapat satu variabel berikan tentang variabel lainnya. Jika dua variabel saling bergantung, Mutual Information mereka akan tinggi, namun jika mereka independen, Mutual Information akan nol.
Dalam prakteknya, Mutual Information sering digunakan dalam masalah klasifikasi atau regresi untuk mengevaluasi seberapa pentingnya suatu fitur terhadap variabel target. Semakin tinggi nilai Mutual Information antara suatu fitur dan variabel target, semakin penting fitur tersebut dalam memprediksi variabel target.
# load dataset
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target
# Label encoding for categoricals
for colname in X.select_dtypes("object"):
X[colname], _ = X[colname].factorize()
# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X.dtypes == int
from sklearn.feature_selection import mutual_info_regression
def make_mi_scores(X, y, discrete_features):
mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3] # show a few features with their MI scores
def plot_mi_scores(scores):
scores = scores.sort_values(ascending=True)
width = np.arange(len(scores))
ticks = list(scores.index)
plt.barh(width, scores)
plt.yticks(width, ticks)
plt.title("Mutual Information Scores")
plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores)
# plt.scatter(y, X['worst perimeter'])
plt.scatter(X['worst perimeter'], X['worst area'], c=y, cmap='viridis')
plt.colorbar(label='Variabel Target (y)')
plt.xlabel('Worst Perimeter')
plt.ylabel('Worst Area')
Text(0, 0.5, 'Worst Area')
Backward elimination merupakan metode iteratif yang dimulai dengan semua fitur dan menghapus fitur yang paling tidak signifikan pada setiap iterasi yang meningkatkan kinerja model. Proses ini diulangi sampai tidak ada peningkatan yang diamati pada penghapusan fitur atau jumlah minimum fitur yang ditentukan.
Prosedur ini dimulai dengan himpunan fitur lengkap. Pada setiap langkah, itu menghapus fitur terburuk yang tersisa dalam himpunan.
# load dataset
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import numpy as np
sfs1 = SFS(RandomForestClassifier(),
k_features=5,
forward=False, # forward = True
floating=False,
verbose=2,
scoring='accuracy',
cv=3)
sfs1 = sfs1.fit(np.array(X), y)
[2024-03-14 02:30:00] Features: 29/5 -- score: 0.9666388192703982 [2024-03-14 02:30:16] Features: 28/5 -- score: 0.9666295368049754 [2024-03-14 02:30:31] Features: 27/5 -- score: 0.9701475912002228 [2024-03-14 02:30:46] Features: 26/5 -- score: 0.9666202543395525 [2024-03-14 02:31:00] Features: 25/5 -- score: 0.9666202543395525 [2024-03-14 02:31:13] Features: 24/5 -- score: 0.9701383087347999 [2024-03-14 02:31:25] Features: 23/5 -- score: 0.9683746403044649 [2024-03-14 02:31:37] Features: 22/5 -- score: 0.9701475912002228 [2024-03-14 02:31:48] Features: 21/5 -- score: 0.9701383087347999 [2024-03-14 02:31:59] Features: 20/5 -- score: 0.9701475912002228 [2024-03-14 02:32:10] Features: 19/5 -- score: 0.9771744175252947 [2024-03-14 02:32:19] Features: 18/5 -- score: 0.971901977165135 [2024-03-14 02:32:29] Features: 17/5 -- score: 0.9754107490949596 [2024-03-14 02:32:37] Features: 16/5 -- score: 0.9736470806646246 [2024-03-14 02:32:45] Features: 15/5 -- score: 0.9754014666295369 [2024-03-14 02:32:52] Features: 14/5 -- score: 0.9701383087347999 [2024-03-14 02:32:59] Features: 13/5 -- score: 0.9718926946997123 [2024-03-14 02:33:06] Features: 12/5 -- score: 0.9736470806646246 [2024-03-14 02:33:11] Features: 11/5 -- score: 0.9736377981992016 [2024-03-14 02:33:17] Features: 10/5 -- score: 0.975392184164114 [2024-03-14 02:33:22] Features: 9/5 -- score: 0.9701197438039544 [2024-03-14 02:33:26] Features: 8/5 -- score: 0.977155852594449 [2024-03-14 02:33:30] Features: 7/5 -- score: 0.9736377981992016 [2024-03-14 02:33:33] Features: 6/5 -- score: 0.9701011788731088 [2024-03-14 02:33:36] Features: 5/5 -- score: 0.9718741297688666
# selected features
X.columns[list(sfs1.k_feature_idx_)]
Index(['mean texture', 'mean smoothness', 'worst area', 'worst smoothness',
'worst concavity'],
dtype='object')
Dataset wine merupakan hasil dari analisis kimia atas anggur yang ditanam di wilayah yang sama di Italia oleh tiga pembudidaya yang berbeda. Terdapat tiga belas pengukuran yang berbeda (13 fitur) untuk komponen-komponen yang ditemukan dalam tiga jenis anggur tersebut.
Data Set Characteristics:
:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
- class:
- class_0
- class_1
- class_2
Menggunakan dataset tersebut, lakukanlah eksplorasi secara mandiri dan gunakan berbagai teknik data prepocessing secara kreatif!!!
from sklearn.datasets import load_wine
# Memuat dataset Wine
wine = load_wine()
# Memuat fitur-fitur dataset ke dalam variabel X
X = pd.DataFrame(wine.data, columns=wine.feature_names)
# Memuat label-label dataset ke dalam variabel y
y = wine.target
X.head()
| alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 |
| 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 |
| 2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 |
| 3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 |
| 4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 |
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2])